Very short report because the time is running out. Also the css is not very appropriate but it’s what I had ready.
Here a heatmap showing the correlation matrix. From the nice patterns we can guess there are some relations between the variables, some are similar to each other etc.
Here the distribution of the target variable, not much to say, it looks a bit like the lognormal distribution.
mean: 69.399, std: 67.241
A lot of features show a significative correlation with the target. I will keep only the most correlated ones for training the model.
This is however a very crude method for doing feature selection: Many features could also be strongly correlated between each others and therefore not adding information, or, not linearly correlated features could have a not-linear dependency with the target.
| Most correlated features | corr |
|---|---|
soil_grids_soc_5_15 |
0.518 |
soil_grids_soc_0_5 |
0.496 |
soil_grids_ocd_5_15 |
0.491 |
soil_grids_nitrogen_0_5 |
0.480 |
soil_olm_soc_b0 |
0.477 |
soil_grids_ocd_0_5 |
0.464 |
soil_olm_soc_b10 |
0.455 |
soil_grids_ocs_0_30 |
0.449 |
LST_Day_1km_09_mean |
-0.429 |
soil_grids_soc_15_30 |
0.424 |
I created a very simple pipeline: it just selects the given features, applies a standard scaler and trains the model.
The model is trained on train_df and tested on test_df, generated randomly. No cross-validation or fancy stuff.
Some evaluation metrics are also computed: rmse, mae and r2.
Here the output for two example models, linear regression and random forest:
Linear Regressor
rmse: 53.925, mae: 31.834, r2: 0.366
Random Forest
rmse: 52.310, mae: 31.278, r2: 0.403
We can see that the random forest performs better.
Both rmses are less than the target std, but far from 0.